log(carat) vs relative price represent? What does the diagonal line without any points represent?Consider the linear relationship between lcarat and lprice:
diamonds2 <- diamonds %>%
mutate(
lcarat = log2(carat),
lprice = log2(price)
)
diamonds2
## # A tibble: 53,940 x 12
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 53,930 more rows, and 2 more variables: lcarat <dbl>,
## # lprice <dbl>
ggplot(diamonds2, aes(lcarat,lprice)) +
geom_bin2d() +
geom_smooth(method="lm",se=FALSE,size=2,color="yellow")
We can see above that the x-axis has been extended to the right compared to the original plot (since the original plot was limited to 2 or fewer carats). The range of lprice did not increase.
mod <- lm(lprice ~ lcarat, data=diamonds2)
diamonds2 <- diamonds2 %>% mutate(rel_price = resid(mod))
diamonds2
## # A tibble: 53,940 x 13
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## # ... with 53,930 more rows, and 3 more variables: lcarat <dbl>,
## # lprice <dbl>, rel_price <dbl>
ggplot(diamonds2,aes(carat,rel_price)) + geom_bin2d()
The above plot is not good as it is showing a distinct pattern in the residuals. The linear model is overpredicting the log price for large carats. The variance of the residuals is significantly higher for smaller carats. Looking back at the picture of the linear model, the model extends beyond the range of lprice, thus there are only points below the line at this point (which is why we see negative residual.)
color_cut <- diamonds2 %>%
group_by(color, cut) %>%
summarize(
price = mean(price),
rel_price = mean(rel_price)
)
color_cut
## # A tibble: 35 x 4
## # Groups: color [?]
## color cut price rel_price
## <ord> <ord> <dbl> <dbl>
## 1 D Fair 4291.061 -0.06695865
## 2 D Good 3405.382 -0.03902942
## 3 D Very Good 3470.467 0.10763401
## 4 D Premium 3631.293 0.11368869
## 5 D Ideal 2629.095 0.21605978
## 6 E Fair 3682.312 -0.16154694
## 7 E Good 3423.644 -0.04676044
## 8 E Very Good 3214.652 0.06901578
## 9 E Premium 3538.914 0.08742779
## 10 E Ideal 2597.550 0.17323162
## # ... with 25 more rows
ggplot(color_cut, aes(color, price)) +
geom_line(aes(group=cut), color="grey80") +
geom_point(aes(color = cut))
Above we see the same trend as in the book, where the lowest quality diamonds (color=J) hae the highest price, especially for Premium cut diamonds. Plotting relative price against color solves the issue.
ggplot(color_cut, aes(color, rel_price)) +
geom_line(aes(group = cut), color = "grey80") +
geom_point(aes(color=cut))
ggplot(diamonds, aes(color, carat)) + geom_boxplot()
color_cut_clarity <- diamonds2 %>%
group_by(color, cut, clarity) %>%
summarize(
price = mean(price),
rel_price = mean(rel_price)
)
ggplot(color_cut_clarity, aes(color,rel_price)) +
geom_line(aes(group=cut), color="grey80") +
geom_point(aes(color=cut)) +
facet_wrap(~clarity)
Table and depth do not have much relatiohship with relative price. This makes sense as table and depth are highly correlated to carat (weight) and we’ve already taken out the effect of carat.
#diamonds2
ggplot(diamonds2,aes(depth,rel_price)) +
geom_bin2d()
#diamonds2
ggplot(diamonds2,aes(table,rel_price)) +
geom_bin2d()
Instead of using geom_line, we can draw a loess curve using geom_smooth:
txhousing
## # A tibble: 8,602 x 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000.000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.083
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.167
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.250
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.333
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.417
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.500
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2000.583
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2000.667
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2000.750
## # ... with 8,592 more rows
#de-season function:
deseas <- function(x,month){
resid(lm(x ~ factor(month), na.action = na.exclude))
}
txhousing <- txhousing %>%
group_by(city) %>%
mutate(rel_sales = deseas(log(sales), month))
ggplot(txhousing, aes(date,rel_sales)) +
geom_line(aes(group=city), alpha = 1/5) +
#geom_line(stat = "summary", fun.y="mean", color="red")
geom_smooth(method="loess",se=FALSE)
## Warning: Removed 568 rows containing non-finite values (stat_smooth).
## Warning: Removed 430 rows containing missing values (geom_path).
+ xlim(2008,2012)) at the long-term trend you’ll notice a weird pattern in 2009-2011. It looks like there was a big dip in 2010. Is this dip “real”? (i.e. can you spot it in the original data)Looking at the relative sales (after taking out the month effect), a number of cities do indeed tend to have a dip in 2010:
ggplot(txhousing, aes(date,rel_sales)) +
geom_line(aes(group=city), alpha = 1/5) +
geom_line(stat = "summary", fun.y="mean", color="red") +
#geom_smooth(method="loess",se=FALSE) +
xlim(2008,2012)
## Warning: Removed 6377 rows containing non-finite values (stat_summary).
## Warning: Removed 6376 rows containing missing values (geom_path).
summary(txhousing)
## city year month sales
## Length:8602 Min. :2000 Min. : 1.000 Min. : 6.0
## Class :character 1st Qu.:2003 1st Qu.: 3.000 1st Qu.: 86.0
## Mode :character Median :2007 Median : 6.000 Median : 169.0
## Mean :2007 Mean : 6.406 Mean : 549.6
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.: 467.0
## Max. :2015 Max. :12.000 Max. :8945.0
## NA's :568
## volume median listings inventory
## Min. :8.350e+05 Min. : 50000 Min. : 0 Min. : 0.000
## 1st Qu.:1.084e+07 1st Qu.:100000 1st Qu.: 682 1st Qu.: 4.900
## Median :2.299e+07 Median :123800 Median : 1283 Median : 6.200
## Mean :1.069e+08 Mean :128131 Mean : 3217 Mean : 7.175
## 3rd Qu.:7.512e+07 3rd Qu.:150000 3rd Qu.: 2954 3rd Qu.: 8.150
## Max. :2.568e+09 Max. :304200 Max. :43107 Max. :55.900
## NA's :568 NA's :616 NA's :1424 NA's :1467
## date rel_sales
## Min. :2000 Min. :-1.8473
## 1st Qu.:2004 1st Qu.:-0.1404
## Median :2008 Median : 0.0117
## Mean :2008 Mean : 0.0000
## 3rd Qu.:2012 3rd Qu.: 0.1531
## Max. :2016 Max. : 0.9080
## NA's :568
#"Months inventory": amount of time it would take to sell all current listings at current pace of sales.
ggplot(txhousing, aes(date,log(inventory))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 603 rows containing missing values (geom_path).
#Total active listings
ggplot(txhousing, aes(date,log(listings))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 518 rows containing missing values (geom_path).
#Total value of sales
ggplot(txhousing, aes(date,log(volume))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 430 rows containing missing values (geom_path).
#Median sale price:
ggplot(txhousing, aes(date,log(median))) +
geom_line(aes(group=city), alpha=1/2)
## Warning: Removed 446 rows containing missing values (geom_path).
dplyr skills to figure out how much data each city is missing. Display the results with a visualization.numDates <- txhousing %>%
summarize(nDates=n_distinct(date)) %>%
select(nDates) %>%
max()
numDates
## [1] 187
cityCnts <- txhousing %>%
na.omit() %>%
group_by(city) %>%
summarize(nComplete=n()) %>%
mutate(pctComplete=nComplete/numDates)
cityCnts
## # A tibble: 46 x 3
## city nComplete pctComplete
## <chr> <int> <dbl>
## 1 Abilene 186 0.9946524
## 2 Amarillo 182 0.9732620
## 3 Arlington 186 0.9946524
## 4 Austin 187 1.0000000
## 5 Bay Area 186 0.9946524
## 6 Beaumont 187 1.0000000
## 7 Brazoria County 129 0.6898396
## 8 Brownsville 82 0.4385027
## 9 Bryan-College Station 187 1.0000000
## 10 Collin County 186 0.9946524
## # ... with 36 more rows
ggplot(cityCnts, aes(city,pctComplete)) +
geom_col() +
labs(x="City", y="% Complete") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
stat_summary() did with dplyr so you can plot the data “by hand”.Comment: I think this question is refering to the geom_line shown on page 229 using stat="summary" and fun.y="mean".
meanRelSalesData <- txhousing %>%
group_by(date) %>%
summarize(meanRelSales = mean(rel_sales, na.rm=TRUE))
meanRelSalesData
## # A tibble: 187 x 2
## date meanRelSales
## <dbl> <dbl>
## 1 2000.000 -0.2625059
## 2 2000.083 -0.1535309
## 3 2000.167 -0.1775694
## 4 2000.250 -0.3273133
## 5 2000.333 -0.2263179
## 6 2000.417 -0.2185162
## 7 2000.500 -0.3047979
## 8 2000.583 -0.1751718
## 9 2000.667 -0.2235483
## 10 2000.750 -0.2263248
## # ... with 177 more rows
ggplot(txhousing, aes(date,rel_sales)) +
geom_line(aes(group=city), alpha = 1/5) +
geom_line(aes(date,meanRelSales),data=meanRelSalesData,color="red")
## Warning: Removed 430 rows containing missing values (geom_path).
#install.packages("broom")
library(broom)
models <- txhousing %>%
group_by(city) %>%
do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
##
## # A tibble: 46 x 2
## city mod
## * <chr> <list>
## 1 Abilene <S3: lm>
## 2 Amarillo <S3: lm>
## 3 Arlington <S3: lm>
## 4 Austin <S3: lm>
## 5 Bay Area <S3: lm>
## 6 Beaumont <S3: lm>
## 7 Brazoria County <S3: lm>
## 8 Brownsville <S3: lm>
## 9 Bryan-College Station <S3: lm>
## 10 Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups: city [46]
## city r.squared adj.r.squared sigma statistic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 0.5298893 0.5003394 0.2817299 17.932066
## 2 Amarillo 0.4493699 0.4147588 0.3015624 12.983427
## 3 Arlington 0.5132337 0.4826369 0.2673618 16.774130
## 4 Austin 0.4873283 0.4551032 0.3102974 15.122641
## 5 Bay Area 0.5552242 0.5272669 0.2646016 19.859696
## 6 Beaumont 0.4304326 0.3946312 0.2754690 12.022792
## 7 Brazoria County 0.5082062 0.4746054 0.2919275 15.124817
## 8 Brownsville 0.1714455 0.1187629 0.4196615 3.254307
## 9 Bryan-College Station 0.6511249 0.6291956 0.4061146 29.692015
## 10 Collin County 0.6009165 0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(AIC, reorder(city,AIC))) +
geom_point()
Looking at AIC: Harlingen, San Marcos, and Brownsville have the highest AIC (2 of the three are the same cities that had the lowest \(R^2\)). Dallas, El Paso, and Midland have the lowest AIC, which are different than those having the highest \(R^2\). (Recall low AIC is better.)
worstAIC <- c("Harlingen", "San Marcos", "Brownsville")
bestAIC <- c("Dallas", "El Paso", "Midland")
extreme <- txhousing %>% ungroup() %>%
filter(city %in% c(worstAIC,bestAIC), !is.na(sales)) %>%
mutate(city = factor(city, c(worstAIC, bestAIC)))
ggplot(extreme, aes(month, log(sales))) +
geom_line(aes(group=year)) +
facet_wrap(~city)
Looking at the plot on page 234, McAllen (low \(R^2\)) and Bryan-College Station (high \(R^2\)) have similar order of magnitude of sales. This is also apparent looking at a boxplot of their sales.
txhousing %>% ungroup() %>%
group_by(city) %>%
filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales)) %>%
summarize(meanSales=mean(sales))
## # A tibble: 2 x 2
## city meanSales
## <chr> <dbl>
## 1 Bryan-College Station 186.7433
## 2 McAllen 162.1730
selectCities <- txhousing %>%
filter(city %in% c("McAllen","Bryan-College Station"), !is.na(sales))
ggplot(selectCities, aes(city,sales)) + geom_boxplot()
If we look at all of the extreme cities, however, we do see that NE Tarrant County (high \(R^2\)) does have significantly more sales. Also, Brownsville and Harlengen (both having low \(R^2\)) do have fewer sales than the others.
selectCities <- txhousing %>%
filter(city %in% c("McAllen","Bryan-College Station", "Lubbock", "NE Tarrant County", "Brownsville", "Harlingen"), !is.na(sales))
ggplot(selectCities, aes(city,sales)) + geom_boxplot()
log(sales) ~ factor(month) + year).models <- txhousing %>%
group_by(city) %>%
do(mod = lm(log2(sales) ~ factor(month) + year, data = ., na.action = na.exclude))
models
## Source: local data frame [46 x 2]
## Groups: <by row>
##
## # A tibble: 46 x 2
## city mod
## * <chr> <list>
## 1 Abilene <S3: lm>
## 2 Amarillo <S3: lm>
## 3 Arlington <S3: lm>
## 4 Austin <S3: lm>
## 5 Bay Area <S3: lm>
## 6 Beaumont <S3: lm>
## 7 Brazoria County <S3: lm>
## 8 Brownsville <S3: lm>
## 9 Bryan-College Station <S3: lm>
## 10 Collin County <S3: lm>
## # ... with 36 more rows
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups: city [46]
## city r.squared adj.r.squared sigma statistic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 0.6843255 0.6625548 0.2315246 31.433387
## 2 Amarillo 0.5313000 0.4989758 0.2790224 16.436631
## 3 Arlington 0.5773500 0.5482018 0.2498469 19.807350
## 4 Austin 0.6540551 0.6301968 0.2556268 27.414184
## 5 Bay Area 0.6608046 0.6374118 0.2317349 28.248216
## 6 Beaumont 0.5200054 0.4869023 0.2536079 15.708670
## 7 Brazoria County 0.5091646 0.4723519 0.2925529 13.831237
## 8 Brownsville 0.2355866 0.1822555 0.4042607 4.417429
## 9 Bryan-College Station 0.8510313 0.8407576 0.2661372 82.835859
## 10 Collin County 0.6889276 0.6674743 0.2353747 32.112942
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
ggplot(model_sum, aes(r.squared, reorder(city,r.squared))) +
geom_point()
Harlingen moved up 10 places (comparing the \(R^2\) values) by adding year to the model.
models <- txhousing %>%
group_by(city) %>%
do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))
model_sum <- models %>% glance(mod)
model_sum
## # A tibble: 46 x 12
## # Groups: city [46]
## city r.squared adj.r.squared sigma statistic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 0.5298893 0.5003394 0.2817299 17.932066
## 2 Amarillo 0.4493699 0.4147588 0.3015624 12.983427
## 3 Arlington 0.5132337 0.4826369 0.2673618 16.774130
## 4 Austin 0.4873283 0.4551032 0.3102974 15.122641
## 5 Bay Area 0.5552242 0.5272669 0.2646016 19.859696
## 6 Beaumont 0.4304326 0.3946312 0.2754690 12.022792
## 7 Brazoria County 0.5082062 0.4746054 0.2919275 15.124817
## 8 Brownsville 0.1714455 0.1187629 0.4196615 3.254307
## 9 Bryan-College Station 0.6511249 0.6291956 0.4061146 29.692015
## 10 Collin County 0.6009165 0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 7 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
#TODO: Order by r.squared. Commands below are not giving me sorted cities.
#Join txhousing to model summary. Then sort by r squared value:
txhousingSorted <- left_join(txhousing,model_sum,by="city") %>% arrange(r.squared) %>% ungroup
#cityOrdered <- model_sum %>% arrange(r.squared) %>% select(city) %>% ungroup
#cityOrdered <- txhousingSorted %>% select(city) %>% unique
#factor(txhousing$city,levels=cityOrdered)
#ggplot(transform(txhousing, city=factor(txhousing$city,levels=cityOrdered)),
# aes(month, log(sales))) +
# geom_line(aes(group=year)) +
# facet_wrap(~city)
ggplot(txhousingSorted, aes(month, log(sales))) +
geom_line(aes(group=year)) +
facet_wrap(~city)
coefs <- models %>% tidy(mod)
coefs
## # A tibble: 552 x 6
## # Groups: city [46]
## city term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene (Intercept) 6.5420182 0.07043248 92.883541 7.898616e-151
## 2 Abilene factor(month)2 0.3537962 0.09960657 3.551937 4.914925e-04
## 3 Abilene factor(month)3 0.6747930 0.09960657 6.774584 1.826729e-10
## 4 Abilene factor(month)4 0.7489369 0.09960657 7.518950 2.758003e-12
## 5 Abilene factor(month)5 0.9164846 0.09960657 9.201046 1.056158e-16
## 6 Abilene factor(month)6 1.0024412 0.09960657 10.064007 4.372570e-19
## 7 Abilene factor(month)7 0.9539842 0.09960657 9.577523 9.810940e-18
## 8 Abilene factor(month)8 0.9337577 0.10125307 9.222019 9.259347e-17
## 9 Abilene factor(month)9 0.6036668 0.10125307 5.961961 1.342427e-08
## 10 Abilene factor(month)10 0.6084391 0.10125307 6.009093 1.055654e-08
## # ... with 542 more rows
#Pull out months:
months <- coefs %>%
filter(grepl("factor",term)) %>%
tidyr::extract(term,"month","(\\d+)", convert=TRUE)
months
## # A tibble: 506 x 6
## # Groups: city [46]
## city month estimate std.error statistic p.value
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2 0.3537962 0.09960657 3.551937 4.914925e-04
## 2 Abilene 3 0.6747930 0.09960657 6.774584 1.826729e-10
## 3 Abilene 4 0.7489369 0.09960657 7.518950 2.758003e-12
## 4 Abilene 5 0.9164846 0.09960657 9.201046 1.056158e-16
## 5 Abilene 6 1.0024412 0.09960657 10.064007 4.372570e-19
## 6 Abilene 7 0.9539842 0.09960657 9.577523 9.810940e-18
## 7 Abilene 8 0.9337577 0.10125307 9.222019 9.259347e-17
## 8 Abilene 9 0.6036668 0.10125307 5.961961 1.342427e-08
## 9 Abilene 10 0.6084391 0.10125307 6.009093 1.055654e-08
## 10 Abilene 11 0.4189183 0.10125307 4.137339 5.449232e-05
## # ... with 496 more rows
#Strength of seasonal effect:
coefSummary <- months %>%
group_by(city) %>%
summarize(max=max(estimate))
#ggplot(coefSummary, aes(2^max, reorder(city,max))) +
# geom_point() +
# ggtitle("Maximum coefficient per City")
#Extract highest and lowest seasonal effect:
top3 <- coefSummary %>% top_n(3,max)
bottom3 <- coefSummary %>% top_n(-3,max)
selectCitySummary <- rbind(top3,bottom3)
#Plot their coeficients:
ggplot(selectCitySummary, aes(2^max, reorder(city,max))) +
geom_point() +
ggtitle("Maximum 2^coefficients for Cities \nwith highest and lowest seasonal effect")
There seems to be a positive relationship between \(R^2\) and seasonal effect:
#model_sum contains r.squared for each city
#coefSummary contains max coef (seasonal effect)
combined <- inner_join(model_sum,coefSummary,by="city")
combined
## # A tibble: 46 x 13
## # Groups: city [?]
## city r.squared adj.r.squared sigma statistic
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 0.5298893 0.5003394 0.2817299 17.932066
## 2 Amarillo 0.4493699 0.4147588 0.3015624 12.983427
## 3 Arlington 0.5132337 0.4826369 0.2673618 16.774130
## 4 Austin 0.4873283 0.4551032 0.3102974 15.122641
## 5 Bay Area 0.5552242 0.5272669 0.2646016 19.859696
## 6 Beaumont 0.4304326 0.3946312 0.2754690 12.022792
## 7 Brazoria County 0.5082062 0.4746054 0.2919275 15.124817
## 8 Brownsville 0.1714455 0.1187629 0.4196615 3.254307
## 9 Bryan-College Station 0.6511249 0.6291956 0.4061146 29.692015
## 10 Collin County 0.6009165 0.5758312 0.2658377 23.954970
## # ... with 36 more rows, and 8 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
## # max <dbl>
ggplot(combined, aes(r.squared,2^max)) +
geom_point() +
labs(y="Max 2^Coef (Seasonal effect)")
Hypothesis: Cities with strongest seasonal effect are college and beach towns.
It can be difficult sometimes to make conclusions about cause. In this case we could perform a hypothesis test comparing seasonal effect for beach and college towns vs all other towns to see if there is a statistically significant difference.
log(price) ~ log(carat). What does the intercept tell you? What does the slope tell you? How do the slope and intercept vary across the groups? Answer with a plot.The slope tells us how log(carat) changes in relationship with changes in log(price) (for each cut, color, and quality). The intercept tells us the starting log(price) reguarless of log(carat).
Next we’ll look at how the slope and intercept vary across groups.
models <- diamonds %>%
group_by(cut,clarity,color) %>%
do(mod = lm(
log(price) ~ log(carat),
data = .,
na.action = na.exclude
))
models
## Source: local data frame [276 x 4]
## Groups: <by row>
##
## # A tibble: 276 x 4
## cut clarity color mod
## * <ord> <ord> <ord> <list>
## 1 Fair I1 D <S3: lm>
## 2 Fair I1 E <S3: lm>
## 3 Fair I1 F <S3: lm>
## 4 Fair I1 G <S3: lm>
## 5 Fair I1 H <S3: lm>
## 6 Fair I1 I <S3: lm>
## 7 Fair I1 J <S3: lm>
## 8 Fair SI2 D <S3: lm>
## 9 Fair SI2 E <S3: lm>
## 10 Fair SI2 F <S3: lm>
## # ... with 266 more rows
models %>% glance(mod)
## # A tibble: 276 x 14
## # Groups: cut, clarity, color [276]
## cut clarity color r.squared adj.r.squared sigma statistic
## <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl>
## 1 Fair I1 D 0.9937512 0.9906267 0.07378993 318.059271
## 2 Fair I1 E 0.4701053 0.3944061 0.30213442 6.210172
## 3 Fair I1 F 0.8560433 0.8516810 0.32529833 196.235634
## 4 Fair I1 G 0.9738591 0.9733466 0.13199534 1899.969075
## 5 Fair I1 H 0.9237048 0.9221789 0.19958859 605.349118
## 6 Fair I1 I 0.9469761 0.9453191 0.13809815 571.501803
## 7 Fair I1 J 0.9626127 0.9608324 0.15083441 540.688367
## 8 Fair SI2 D 0.9133422 0.9117374 0.19026661 569.140643
## 9 Fair SI2 E 0.9578375 0.9572827 0.13575207 1726.549900
## 10 Fair SI2 F 0.9313385 0.9305493 0.18134425 1180.085357
## # ... with 266 more rows, and 7 more variables: p.value <dbl>, df <int>,
## # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
slope <- models %>% tidy(mod) %>% filter(grepl("carat",term))
slope
## # A tibble: 271 x 8
## # Groups: cut, clarity, color [271]
## cut clarity color term estimate std.error statistic
## <ord> <ord> <ord> <chr> <dbl> <dbl> <dbl>
## 1 Fair I1 D log(carat) 1.398200 0.07839988 17.834216
## 2 Fair I1 E log(carat) 1.242414 0.49855648 2.492022
## 3 Fair I1 F log(carat) 1.496369 0.10681933 14.008413
## 4 Fair I1 G log(carat) 1.623653 0.03724947 43.588635
## 5 Fair I1 H log(carat) 1.603171 0.06515939 24.603844
## 6 Fair I1 I log(carat) 1.652808 0.06913749 23.906104
## 7 Fair I1 J log(carat) 1.397439 0.06009792 23.252707
## 8 Fair SI2 D log(carat) 1.783365 0.07475331 23.856669
## 9 Fair SI2 E log(carat) 1.826243 0.04395103 41.551774
## 10 Fair SI2 F log(carat) 1.659332 0.04830327 34.352370
## # ... with 261 more rows, and 1 more variables: p.value <dbl>
intercept <- models %>% tidy(mod) %>% filter(grepl("Intercept",term))
intercept
## # A tibble: 276 x 8
## # Groups: cut, clarity, color [276]
## cut clarity color term estimate std.error statistic
## <ord> <ord> <ord> <chr> <dbl> <dbl> <dbl>
## 1 Fair I1 D (Intercept) 7.962306 0.05477466 145.3648
## 2 Fair I1 E (Intercept) 7.644798 0.10398708 73.5168
## 3 Fair I1 F (Intercept) 7.654878 0.05625926 136.0643
## 4 Fair I1 G (Intercept) 7.613672 0.01838181 414.1959
## 5 Fair I1 H (Intercept) 7.605013 0.03418893 222.4408
## 6 Fair I1 I (Intercept) 7.630474 0.02809506 271.5949
## 7 Fair I1 J (Intercept) 7.624676 0.04572844 166.7382
## 8 Fair SI2 D (Intercept) 8.243521 0.02560794 321.9127
## 9 Fair SI2 E (Intercept) 8.206757 0.01550199 529.4001
## 10 Fair SI2 F (Intercept) 8.174321 0.01922257 425.2459
## # ... with 266 more rows, and 1 more variables: p.value <dbl>
#Plot slope estimates:
ggplot(slope,aes(clarity,estimate,color=color)) +
geom_point() +
facet_wrap(~cut) +
labs(title="Slope") + ylab("Estimate of slope")
#Plot intercept estimates:
ggplot(slope,aes(clarity,estimate,color=color)) +
geom_point() +
facet_wrap(~cut) +
labs(title="Intercept") + ylab("Estimate of intercept")
.fitted) vs. residuals (.resid). Do you see any patterns? What if you include the city or month on the same plot?txhousing <- txhousing %>%
mutate(id = row_number())
models <- txhousing %>%
group_by(city) %>%
do(mod = lm(log2(sales) ~ factor(month), data = ., na.action = na.exclude))
obsSummary <- models %>% augment(mod)
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
## Warning in augment_columns(x, data, newdata, type.predict = type.predict, :
## When fitting with na.exclude, rows with NA in original data will be dropped
## unless those rows are provided in 'data' argument
obsSummary
## # A tibble: 8,034 x 13
## # Groups: city [46]
## city log2.sales. factor.month. .fitted .se.fit .resid
## <chr> <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 Abilene 6.169925 1 6.542018 0.07043248 -0.3720932
## 2 Abilene 6.614710 2 6.895814 0.07043248 -0.2811046
## 3 Abilene 7.022368 3 7.216811 0.07043248 -0.1944434
## 4 Abilene 6.614710 4 7.290955 0.07043248 -0.6762452
## 5 Abilene 7.139551 5 7.458503 0.07043248 -0.3189515
## 6 Abilene 7.285402 6 7.544459 0.07043248 -0.2590572
## 7 Abilene 7.247928 7 7.496002 0.07043248 -0.2480749
## 8 Abilene 7.033423 8 7.475776 0.07274235 -0.4423529
## 9 Abilene 6.700440 9 7.145685 0.07274235 -0.4452453
## 10 Abilene 6.658211 10 7.150457 0.07274235 -0.4922458
## # ... with 8,024 more rows, and 7 more variables: .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>, .rownames <chr>,
## # `log2(sales)` <dbl>, `factor(month)` <fctr>
p <- ggplot(obsSummary, aes(.fitted,.resid)) +
geom_point()
p
We see in the above plot a reduction in variance of the residuals as the fitted values get larger. The plots look much better when we add month (the variation of residuals seems to be high for month=NA):
p + facet_wrap(~factor.month.)
#p + facet_wrap(~city)
log(sales) for each city. Highlight points tha thave a standardized residual of greater than 2.largeResid <- obsSummary %>% filter(abs(.std.resid)>2)
largeResid
## # A tibble: 268 x 13
## # Groups: city [43]
## city log2.sales. factor.month. .fitted .se.fit .resid
## <chr> <dbl> <fctr> <dbl> <dbl> <dbl>
## 1 Abilene 6.614710 4 7.290955 0.07043248 -0.6762452
## 2 Abilene 6.714246 4 7.290955 0.07043248 -0.5767095
## 3 Abilene 7.303781 1 6.542018 0.07043248 0.7617625
## 4 Abilene 8.066089 7 7.496002 0.07043248 0.5700868
## 5 Amarillo 6.672425 1 7.282885 0.07539060 -0.6104594
## 6 Amarillo 7.303781 9 7.891996 0.07786308 -0.5882150
## 7 Amarillo 6.066089 10 7.718550 0.07786308 -1.6524607
## 8 Amarillo 7.507795 7 8.092723 0.07539060 -0.5849281
## 9 Arlington 9.491853 6 8.958743 0.06684046 0.5331099
## 10 Arlington 9.481799 8 8.942010 0.06903253 0.5397894
## # ... with 258 more rows, and 7 more variables: .hat <dbl>, .sigma <dbl>,
## # .cooksd <dbl>, .std.resid <dbl>, .rownames <chr>, `log2(sales)` <dbl>,
## # `factor(month)` <fctr>
ggplot(txhousing,aes(date,log(sales))) +
geom_line(aes(group=city))
## Warning: Removed 430 rows containing missing values (geom_path).
Need to link augment results back up with original dataset. See broom issue discussion: https://github.com/tidyverse/broom/issues/168